Introduction

Blocking is typically done to reduce the number of tuple pairs considered for matching. There are several blocking methods proposed. The py_entitymatching package supports a subset of such blocking methods (#ref to what is supported). One such supported blocker is attribute equivalence blocker. This IPython notebook illustrates how to perform blocking using attribute equivalence blocker.

First, we need to import py_entitymatching package and other libraries as follows:



In [36]:

    
%load_ext autotime



In [1]:

    
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

Then, read the input tablse from the datasets directory



In [2]:

    
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

# Get the paths of the input tables
path_A = datasets_dir + os.sep + 'person_table_A.csv'
path_B = datasets_dir + os.sep + 'person_table_B.csv'



In [3]:

    
# Read the CSV files and set 'ID' as the key attribute
A = em.read_csv_metadata(path_A, key='ID')
B = em.read_csv_metadata(path_B, key='ID')



In [4]:

    
A.head()









    Out[4]:






  
    
      
      ID
      name
      birth_year
      hourly_wage
      address
      zipcode
    
  
  
    
      0
      a1
      Kevin Smith
      1989
      30.0
      607 From St, San Francisco
      94107
    
    
      1
      a2
      Michael Franklin
      1988
      27.5
      1652 Stockton St, San Francisco
      94122
    
    
      2
      a3
      William Bridge
      1986
      32.0
      3131 Webster St, San Francisco
      94107
    
    
      3
      a4
      Binto George
      1987
      32.5
      423 Powell St, San Francisco
      94122
    
    
      4
      a5
      Alphonse Kemper
      1984
      35.0
      1702 Post Street, San Francisco
      94122



In [5]:

    
B.head()









    Out[5]:






  
    
      
      ID
      name
      birth_year
      hourly_wage
      address
      zipcode
    
  
  
    
      0
      b1
      Mark Levene
      1987
      29.5
      108 Clement St, San Francisco
      94107
    
    
      1
      b2
      Bill Bridge
      1986
      32.0
      3131 Webster St, San Francisco
      94107
    
    
      2
      b3
      Mike Franklin
      1988
      27.5
      1652 Stockton St, San Francisco
      94122
    
    
      3
      b4
      Joseph Kuan
      1982
      26.0
      108 South Park, San Francisco
      94122
    
    
      4
      b5
      Alfons Kemper
      1984
      35.0
      170 Post St, Apt 4,  San Francisco
      94122

Different Ways to Block Using Attribute Equivalence Blocker

Once the tables are read, we can do blocking using attribute equivalence blocker.

There are three different ways to do attribute equivalence blocking:

Block two tables to produce a candidate set of tuple pairs.
Block a candidate set of tuple pairs to typically produce a reduced candidate set of tuple pairs.
Block two tuples to check if a tuple pair would get blocked.

Block Tables to Produce a Candidate Set of Tuple Pairs



In [6]:

    
# Instantiate attribute equivalence blocker object
ab = em.AttrEquivalenceBlocker()

For the given two tables, we will assume that two persons with different zipcode values do not refer to the same real world person. So, we apply attribute equivalence blocking on zipcode. That is, we block all the tuple pairs that have different zipcodes.



In [7]:

    
# Use block_tables to apply blocking over two input tables.
C1 = ab.block_tables(A, B, 
                    l_block_attr='zipcode', r_block_attr='zipcode', 
                    l_output_attrs=['name', 'birth_year', 'zipcode'],
                    r_output_attrs=['name', 'birth_year', 'zipcode'],
                    l_output_prefix='l_', r_output_prefix='r_')



In [8]:

    
# Display the candidate set of tuple pairs
C1.head()









    Out[8]:






  
    
      
      _id
      l_ID
      r_ID
      l_name
      l_birth_year
      l_zipcode
      r_name
      r_birth_year
      r_zipcode
    
  
  
    
      0
      0
      a1
      b1
      Kevin Smith
      1989
      94107
      Mark Levene
      1987
      94107
    
    
      1
      1
      a1
      b2
      Kevin Smith
      1989
      94107
      Bill Bridge
      1986
      94107
    
    
      2
      2
      a1
      b6
      Kevin Smith
      1989
      94107
      Michael Brodie
      1987
      94107
    
    
      3
      3
      a3
      b1
      William Bridge
      1986
      94107
      Mark Levene
      1987
      94107
    
    
      4
      4
      a3
      b2
      William Bridge
      1986
      94107
      Bill Bridge
      1986
      94107

Note that the tuple pairs in the candidate set have the same zipcode.

The attributes included in the candidate set are based on l_output_attrs and r_output_attrs mentioned in block_tables command (the key columns are included by default). Specifically, the list of attributes mentioned in l_output_attrs are picked from table A and the list of attributes mentioned in r_output_attrs are picked from table B. The attributes in the candidate set are prefixed based on l_output_prefix and r_ouptut_prefix parameter values mentioned in block_tables command.



In [9]:

    
# Show the metadata of C1
em.show_properties(C1)









    



id: 4565680872
fk_rtable: r_ID
rtable(obj.id): 4565204272
fk_ltable: l_ID
key: _id
ltable(obj.id): 4565203432



In [10]:

    
id(A), id(B)









    Out[10]:





(4565203432, 4565204272)

Note that the metadata of C1 includes key, foreign key to the left and right tables (i.e A and B) and pointers to left and right tables.

Handling Missing Values

If the input tuples have missing values in the blocking attribute, then they are ignored by default. This is because, including all possible tuple pairs with missing values can significantly increase the size of the candidate set. But if you want to include them, then you can set allow_missing paramater to be True.



In [11]:

    
# Introduce some missing values
A1 = em.read_csv_metadata(path_A, key='ID')
A1.ix[0, 'zipcode'] = pd.np.NaN
A1.ix[0, 'birth_year'] = pd.np.NaN



In [12]:

    
A1









    Out[12]:






  
    
      
      ID
      name
      birth_year
      hourly_wage
      address
      zipcode
    
  
  
    
      0
      a1
      Kevin Smith
      NaN
      30.0
      607 From St, San Francisco
      NaN
    
    
      1
      a2
      Michael Franklin
      1988.0
      27.5
      1652 Stockton St, San Francisco
      94122.0
    
    
      2
      a3
      William Bridge
      1986.0
      32.0
      3131 Webster St, San Francisco
      94107.0
    
    
      3
      a4
      Binto George
      1987.0
      32.5
      423 Powell St, San Francisco
      94122.0
    
    
      4
      a5
      Alphonse Kemper
      1984.0
      35.0
      1702 Post Street, San Francisco
      94122.0



In [13]:

    
# Use block_tables to apply blocking over two input tables.
C2 = ab.block_tables(A1, B, 
                    l_block_attr='zipcode', r_block_attr='zipcode', 
                    l_output_attrs=['name', 'birth_year', 'zipcode'],
                    r_output_attrs=['name', 'birth_year', 'zipcode'],
                    l_output_prefix='l_', r_output_prefix='r_', 
                    allow_missing=True) # setting allow_missing parameter to True



In [14]:

    
len(C1), len(C2)









    Out[14]:





(15, 18)



In [15]:

    
C2









    Out[15]:






  
    
      
      _id
      l_ID
      r_ID
      l_name
      l_birth_year
      l_zipcode
      r_name
      r_birth_year
      r_zipcode
    
  
  
    
      0
      0
      a2
      b3
      Michael Franklin
      1988.0
      94122.0
      Mike Franklin
      1988
      94122.0
    
    
      1
      1
      a2
      b4
      Michael Franklin
      1988.0
      94122.0
      Joseph Kuan
      1982
      94122.0
    
    
      2
      2
      a2
      b5
      Michael Franklin
      1988.0
      94122.0
      Alfons Kemper
      1984
      94122.0
    
    
      3
      3
      a4
      b3
      Binto George
      1987.0
      94122.0
      Mike Franklin
      1988
      94122.0
    
    
      4
      4
      a4
      b4
      Binto George
      1987.0
      94122.0
      Joseph Kuan
      1982
      94122.0
    
    
      5
      5
      a4
      b5
      Binto George
      1987.0
      94122.0
      Alfons Kemper
      1984
      94122.0
    
    
      6
      6
      a5
      b3
      Alphonse Kemper
      1984.0
      94122.0
      Mike Franklin
      1988
      94122.0
    
    
      7
      7
      a5
      b4
      Alphonse Kemper
      1984.0
      94122.0
      Joseph Kuan
      1982
      94122.0
    
    
      8
      8
      a5
      b5
      Alphonse Kemper
      1984.0
      94122.0
      Alfons Kemper
      1984
      94122.0
    
    
      9
      9
      a3
      b1
      William Bridge
      1986.0
      94107.0
      Mark Levene
      1987
      94107.0
    
    
      10
      10
      a3
      b2
      William Bridge
      1986.0
      94107.0
      Bill Bridge
      1986
      94107.0
    
    
      11
      11
      a3
      b6
      William Bridge
      1986.0
      94107.0
      Michael Brodie
      1987
      94107.0
    
    
      12
      12
      a1
      b1
      Kevin Smith
      NaN
      NaN
      Mark Levene
      1987
      94107.0
    
    
      13
      13
      a1
      b2
      Kevin Smith
      NaN
      NaN
      Bill Bridge
      1986
      94107.0
    
    
      14
      14
      a1
      b3
      Kevin Smith
      NaN
      NaN
      Mike Franklin
      1988
      94122.0
    
    
      15
      15
      a1
      b4
      Kevin Smith
      NaN
      NaN
      Joseph Kuan
      1982
      94122.0
    
    
      16
      16
      a1
      b5
      Kevin Smith
      NaN
      NaN
      Alfons Kemper
      1984
      94122.0
    
    
      17
      17
      a1
      b6
      Kevin Smith
      NaN
      NaN
      Michael Brodie
      1987
      94107.0

The candidate set C2 includes all possible tuple pairs with missing values.

Block a Candidate Set of Tuple Pairs

In the above, we see that the candidate set produced after blocking over input tables include tuple pairs that have different birth years. We will assume that two persons with different birth years cannot refer to the same person. So, we block the candidate set of tuple pairs on birth_year. That is, we block all the tuple pairs that have different birth years.



In [37]:

    
# Instantiate Attr. Equivalence Blocker
ab = em.AttrEquivalenceBlocker()
# Use block_tables to apply blocking over two input tables.
C3 = ab.block_candset(C1, l_block_attr='birth_year', r_block_attr='birth_year')









    



0%           100%
[###############] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00





    



time: 45.3 ms






    



Total time elapsed: 00:00:00



In [17]:

    
C3.head()









    Out[17]:






  
    
      
      _id
      l_ID
      r_ID
      l_name
      l_birth_year
      l_zipcode
      r_name
      r_birth_year
      r_zipcode
    
  
  
    
      4
      4
      a3
      b2
      William Bridge
      1986
      94107
      Bill Bridge
      1986
      94107
    
    
      6
      6
      a2
      b3
      Michael Franklin
      1988
      94122
      Mike Franklin
      1988
      94122
    
    
      14
      14
      a5
      b5
      Alphonse Kemper
      1984
      94122
      Alfons Kemper
      1984
      94122

Note that, the tuple pairs in the resulting candidate set have the same birth year.

The attributes included in the resulting candidate set are based on the input candidate set (i.e the same attributes are retained).



In [18]:

    
# Show the metadata of C1
em.show_properties(C3)









    



id: 4565765144
fk_rtable: r_ID
rtable(obj.id): 4565204272
fk_ltable: l_ID
key: _id
ltable(obj.id): 4565203432



In [19]:

    
id(A), id(B)









    Out[19]:





(4565203432, 4565204272)

As we saw earlier the metadata of C3 includes the same metadata as C1. That is, it includes key, foreign key to the left and right tables (i.e A and B) and pointers to left and right tables.

Handling Missing Values

If the tuple pairs included in the candidate set have missing values in the blocking attribute, then they are ignored by default. This is because, including all possible tuple pairs with missing values can significantly increase the size of the candidate set. But if you want to include them, then you can set allow_missing paramater to be True.



In [20]:

    
# Display C2 (got by blocking over A1 and B)
C2









    Out[20]:






  
    
      
      _id
      l_ID
      r_ID
      l_name
      l_birth_year
      l_zipcode
      r_name
      r_birth_year
      r_zipcode
    
  
  
    
      0
      0
      a2
      b3
      Michael Franklin
      1988.0
      94122.0
      Mike Franklin
      1988
      94122.0
    
    
      1
      1
      a2
      b4
      Michael Franklin
      1988.0
      94122.0
      Joseph Kuan
      1982
      94122.0
    
    
      2
      2
      a2
      b5
      Michael Franklin
      1988.0
      94122.0
      Alfons Kemper
      1984
      94122.0
    
    
      3
      3
      a4
      b3
      Binto George
      1987.0
      94122.0
      Mike Franklin
      1988
      94122.0
    
    
      4
      4
      a4
      b4
      Binto George
      1987.0
      94122.0
      Joseph Kuan
      1982
      94122.0
    
    
      5
      5
      a4
      b5
      Binto George
      1987.0
      94122.0
      Alfons Kemper
      1984
      94122.0
    
    
      6
      6
      a5
      b3
      Alphonse Kemper
      1984.0
      94122.0
      Mike Franklin
      1988
      94122.0
    
    
      7
      7
      a5
      b4
      Alphonse Kemper
      1984.0
      94122.0
      Joseph Kuan
      1982
      94122.0
    
    
      8
      8
      a5
      b5
      Alphonse Kemper
      1984.0
      94122.0
      Alfons Kemper
      1984
      94122.0
    
    
      9
      9
      a3
      b1
      William Bridge
      1986.0
      94107.0
      Mark Levene
      1987
      94107.0
    
    
      10
      10
      a3
      b2
      William Bridge
      1986.0
      94107.0
      Bill Bridge
      1986
      94107.0
    
    
      11
      11
      a3
      b6
      William Bridge
      1986.0
      94107.0
      Michael Brodie
      1987
      94107.0
    
    
      12
      12
      a1
      b1
      Kevin Smith
      NaN
      NaN
      Mark Levene
      1987
      94107.0
    
    
      13
      13
      a1
      b2
      Kevin Smith
      NaN
      NaN
      Bill Bridge
      1986
      94107.0
    
    
      14
      14
      a1
      b3
      Kevin Smith
      NaN
      NaN
      Mike Franklin
      1988
      94122.0
    
    
      15
      15
      a1
      b4
      Kevin Smith
      NaN
      NaN
      Joseph Kuan
      1982
      94122.0
    
    
      16
      16
      a1
      b5
      Kevin Smith
      NaN
      NaN
      Alfons Kemper
      1984
      94122.0
    
    
      17
      17
      a1
      b6
      Kevin Smith
      NaN
      NaN
      Michael Brodie
      1987
      94107.0



In [21]:

    
em.show_properties(C2)









    



id: 4565567248
fk_rtable: r_ID
rtable(obj.id): 4565204272
fk_ltable: l_ID
key: _id
ltable(obj.id): 4565680928



In [22]:

    
em.show_properties(A1)









    



id: 4565680928
key: ID

We see that A1 is the left table to C2.



In [23]:

    
A1.head()









    Out[23]:






  
    
      
      ID
      name
      birth_year
      hourly_wage
      address
      zipcode
    
  
  
    
      0
      a1
      Kevin Smith
      NaN
      30.0
      607 From St, San Francisco
      NaN
    
    
      1
      a2
      Michael Franklin
      1988.0
      27.5
      1652 Stockton St, San Francisco
      94122.0
    
    
      2
      a3
      William Bridge
      1986.0
      32.0
      3131 Webster St, San Francisco
      94107.0
    
    
      3
      a4
      Binto George
      1987.0
      32.5
      423 Powell St, San Francisco
      94122.0
    
    
      4
      a5
      Alphonse Kemper
      1984.0
      35.0
      1702 Post Street, San Francisco
      94122.0



In [24]:

    
C4 = ab.block_candset(C2, l_block_attr='birth_year', r_block_attr='birth_year', allow_missing=False)









    



0%              100%
[##################] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00
Total time elapsed: 00:00:00



In [25]:

    
C4









    Out[25]:






  
    
      
      _id
      l_ID
      r_ID
      l_name
      l_birth_year
      l_zipcode
      r_name
      r_birth_year
      r_zipcode
    
  
  
    
      0
      0
      a2
      b3
      Michael Franklin
      1988.0
      94122.0
      Mike Franklin
      1988
      94122.0
    
    
      8
      8
      a5
      b5
      Alphonse Kemper
      1984.0
      94122.0
      Alfons Kemper
      1984
      94122.0
    
    
      10
      10
      a3
      b2
      William Bridge
      1986.0
      94107.0
      Bill Bridge
      1986
      94107.0



In [26]:

    
# Set allow_missing to True
C5 = ab.block_candset(C2, l_block_attr='birth_year', r_block_attr='birth_year', allow_missing=True)









    



0%              100%
[##################] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00
Total time elapsed: 00:00:00



In [27]:

    
len(C4), len(C5)









    Out[27]:





(3, 9)



In [28]:

    
C5









    Out[28]:






  
    
      
      _id
      l_ID
      r_ID
      l_name
      l_birth_year
      l_zipcode
      r_name
      r_birth_year
      r_zipcode
    
  
  
    
      0
      0
      a2
      b3
      Michael Franklin
      1988.0
      94122.0
      Mike Franklin
      1988
      94122.0
    
    
      8
      8
      a5
      b5
      Alphonse Kemper
      1984.0
      94122.0
      Alfons Kemper
      1984
      94122.0
    
    
      10
      10
      a3
      b2
      William Bridge
      1986.0
      94107.0
      Bill Bridge
      1986
      94107.0
    
    
      12
      12
      a1
      b1
      Kevin Smith
      NaN
      NaN
      Mark Levene
      1987
      94107.0
    
    
      13
      13
      a1
      b2
      Kevin Smith
      NaN
      NaN
      Bill Bridge
      1986
      94107.0
    
    
      14
      14
      a1
      b3
      Kevin Smith
      NaN
      NaN
      Mike Franklin
      1988
      94122.0
    
    
      15
      15
      a1
      b4
      Kevin Smith
      NaN
      NaN
      Joseph Kuan
      1982
      94122.0
    
    
      16
      16
      a1
      b5
      Kevin Smith
      NaN
      NaN
      Alfons Kemper
      1984
      94122.0
    
    
      17
      17
      a1
      b6
      Kevin Smith
      NaN
      NaN
      Michael Brodie
      1987
      94107.0

Block Two tuples To Check If a Tuple Pair Would Get Blocked

We can apply attribute equivalence blocking to a tuple pair to check if it is going to get blocked. For example, we can check if the first tuple from A and B will get blocked if we block on zipcode.



In [29]:

    
# Display the first tuple from table A
A.ix[[0]]









    Out[29]:






  
    
      
      ID
      name
      birth_year
      hourly_wage
      address
      zipcode
    
  
  
    
      0
      a1
      Kevin Smith
      1989
      30.0
      607 From St, San Francisco
      94107



In [30]:

    
# Display the first tuple from table B
B.ix[[0]]









    Out[30]:






  
    
      
      ID
      name
      birth_year
      hourly_wage
      address
      zipcode
    
  
  
    
      0
      b1
      Mark Levene
      1987
      29.5
      108 Clement St, San Francisco
      94107



In [31]:

    
# Instantiate Attr. Equivalence Blocker
ab = em.AttrEquivalenceBlocker()

# Apply blocking to a tuple pair from the input tables on zipcode and get blocking status
status = ab.block_tuples(A.ix[0], B.ix[0], l_block_attr='zipcode', r_block_attr='zipcode')

# Print the blocking status
print(status)









    



False

The above result says that the tuple pair will not be blocked, i.e. this tuple pair will be included in the candidate set.

Currently, block_tuples command does not handle missing values

	ID	name	birth_year	hourly_wage	address	zipcode
0	a1	Kevin Smith	1989	30.0	607 From St, San Francisco	94107
1	a2	Michael Franklin	1988	27.5	1652 Stockton St, San Francisco	94122
2	a3	William Bridge	1986	32.0	3131 Webster St, San Francisco	94107
3	a4	Binto George	1987	32.5	423 Powell St, San Francisco	94122
4	a5	Alphonse Kemper	1984	35.0	1702 Post Street, San Francisco	94122

	ID	name	birth_year	hourly_wage	address	zipcode
0	b1	Mark Levene	1987	29.5	108 Clement St, San Francisco	94107
1	b2	Bill Bridge	1986	32.0	3131 Webster St, San Francisco	94107
2	b3	Mike Franklin	1988	27.5	1652 Stockton St, San Francisco	94122
3	b4	Joseph Kuan	1982	26.0	108 South Park, San Francisco	94122
4	b5	Alfons Kemper	1984	35.0	170 Post St, Apt 4, San Francisco	94122

	_id	l_ID	r_ID	l_name	l_birth_year	l_zipcode	r_name	r_birth_year	r_zipcode
0	0	a2	b3	Michael Franklin	1988.0	94122.0	Mike Franklin	1988	94122.0
1	1	a2	b4	Michael Franklin	1988.0	94122.0	Joseph Kuan	1982	94122.0
2	2	a2	b5	Michael Franklin	1988.0	94122.0	Alfons Kemper	1984	94122.0
3	3	a4	b3	Binto George	1987.0	94122.0	Mike Franklin	1988	94122.0
4	4	a4	b4	Binto George	1987.0	94122.0	Joseph Kuan	1982	94122.0
5	5	a4	b5	Binto George	1987.0	94122.0	Alfons Kemper	1984	94122.0
6	6	a5	b3	Alphonse Kemper	1984.0	94122.0	Mike Franklin	1988	94122.0
7	7	a5	b4	Alphonse Kemper	1984.0	94122.0	Joseph Kuan	1982	94122.0
8	8	a5	b5	Alphonse Kemper	1984.0	94122.0	Alfons Kemper	1984	94122.0
9	9	a3	b1	William Bridge	1986.0	94107.0	Mark Levene	1987	94107.0
10	10	a3	b2	William Bridge	1986.0	94107.0	Bill Bridge	1986	94107.0
11	11	a3	b6	William Bridge	1986.0	94107.0	Michael Brodie	1987	94107.0
12	12	a1	b1	Kevin Smith	NaN	NaN	Mark Levene	1987	94107.0
13	13	a1	b2	Kevin Smith	NaN	NaN	Bill Bridge	1986	94107.0
14	14	a1	b3	Kevin Smith	NaN	NaN	Mike Franklin	1988	94122.0
15	15	a1	b4	Kevin Smith	NaN	NaN	Joseph Kuan	1982	94122.0
16	16	a1	b5	Kevin Smith	NaN	NaN	Alfons Kemper	1984	94122.0
17	17	a1	b6	Kevin Smith	NaN	NaN	Michael Brodie	1987	94107.0